class: center, middle, inverse, title-slide .title[ # FASTA/Q ] .author[ ###
James Ashmore
• 23-Sep-2022 ] .institute[ ### Zifo RnD Solutions ] --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.3.1/css/all.css" integrity="sha384-mzrmE5qonljUremFsqc01SB46JvROS7bZs3IO2EmfFsd15uHvIt+Y8vEf7N7fWAU" crossorigin="anonymous"> <!-- ------------ Only edit title, subtitle & author above this ------------ --> --- ## Introduction * One of the hardest things in bioinformatics is developing a **good** file format * It might seem trivial, but a lot of effort is put into developing file formats that are **efficient** yet can be **quickly** and **easily** parsed * Around the storage of primary sequences, FASTA/Q have become the de-facto standard * Many have tried to come up with *better* formats, but none have prevailed <br> <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/formats/xkcd-standards.png" alt="xkcd: Standards" width="50%" /> <p class="caption">xkcd: Standards</p> </div> --- ## FASTA ### Overview * Text-based format for representing either nucleotide or amino acid sequences * Nucleotides or amino acids are represented using single-letter codes: * IUPAC ambiguous DNA: `GATCRYWSMKHBVDN` * IUPAC protein alphabet of the **20** standard amino acids: `ACDEFGHIKLMNPQRSTVWY` * A FASTA file uses two or more lines per record: * The header line begins with `>` and gives a unique identifier for the sequence * The sequence line(s) contain the actual sequence and are wrapped every **80** characters * Filename extensions: `.fasta` `.fa` ```fasta >seq1 GTAGATCGCATCGACTACTACTACGTACGTACGATCGTACGTACGATCGATCGATCGATCGATCGATCGACTGAGCACTG GATGATCGATGAGCTATACAGTGTC >seq2 GATCGATCGTACGTACGTACGATCACGATCGTACGATCGATCGACTACTACTGATCGATCGATCGATCGATCGATCGTAC GTATATAGCCTTCGATCGTACGAGGGCCTCTCTCCGCGATAGATACGAGCGCGCCGATCGATCGATCG >seq3 GCTATTAGTACGATCGATCGACTATCTACGACATCGACGATCGACGATCGATCGATCGATCATCGACACACACTAGGCGA GCTACGTACGTACATCATCGACTGATCG >seq4 TTAGACTACGAGATCGATCATGGCCCCTATTAGCGCGCGCGCGCTACGACTACTTTATCGACTCTCGATATAGCGCATCG GATCATCGACGTATCGTACATTATATTATTTTTTTCGCGCTAGCATCGTACGATCTCTCTCGCGACGTACGATCGA ``` --- ## FASTA ### Header line <br> <img src="data:image/png;base64,#data/formats/fasta-identifier-coloured.jpg" width="85%" style="display: block; margin: auto;" /> <br> * The header line begins with `>` and gives a unique identifier for the sequence * The header line may also contain additional information: * NCBI Identifiers: * GenInfo integrated database `gi|integer` * GenBank `gb|accession` * Definition `text` * Organism `[organism]` --- ## FASTA ### Sequence lines(s) <br> <img src="data:image/png;base64,#data/formats/fasta-sequence-coloured.jpg" width="85%" style="display: block; margin: auto;" /> <br> * The sequence line(s) contain the actual sequence and are wrapped every **80** characters * Expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes * In genome assembly sequences, repeat regions are sometimes masked: * Sort mask: `atatatatatat` * Hard mask: `NNNNNNNNNNNN` * *Warning! Always check the wrapping on your sequences and the validation performed by the bioinformatics software - some poorly written software will only read the first 80 characters, skipping any which appear afterward!* --- ## FASTA ### Alignment representation .pull-left-50[ * The FASTA format can also be used to store multiple sequence alignment output * Sequences are padded using the `-` character to represent insertions and deletions * The FASTA format was originally designed as a protein sequence similarity searching format (FASTP) in the 1980's * Over time the need grew for a simple and flexible way to query DNA and protein sequences against entire databases ] .pull-right-50[ <img src="data:image/png;base64,#data/formats/fasta-alignment.jpg" width="100%" style="display: block; margin: auto;" /> ] --- ## Sequencing *A quick refresher before the next section...*
--- ## FASTQ ### Overview .pull-left-55[ * An extension of FASTA format to handle base quality metrics from sequencing machines * Each sequence is represented by 4 lines: 1. Identifier of the sequence starting with `@` 2. Actual sequence 3. Optional identifier starting with `+` 4. Quality scores for each base * FASTQ files are created and compressed with extension `.fastq.gz` * For each sample per flow cell lane: * Single-end: `.fastq.gz` * Paired-end: `_1.fastq.gz` `_2.fastq.gz` ] .pull-right-45[ ```fastq @HWUSI-EAS100R:6:73:941:1973#0/1 TGAAGNCTATAAACTAAGAAGCAAGCACACTAGGAGTT + AAAAA#EEEEA/EEEEEEE6EAEAEEEEEEEEEEEEEE @HWUSI-EAS100R:6:73:942:1973#0/1 GTCACNATTCTCAAGGCCGTCGTCTTTTTAGTCGGTTT + AAAAA#EEEAEEAEEEEEEEAAAEEEEEE6E/E<E/AE @HWUSI-EAS100R:6:73:943:1973#0/1 GTCGCTAAGCTCTATATAGCGCGCGGGGGTATAGCTCG + AAAAA#EEEAEEAEEEEEEEEEEEEEEEE6E/E<EEEE ``` ] --- ## FASTQ ### Sequence identifier <br> <img src="data:image/png;base64,#data/formats/fastq-identifier-coloured.jpg" width="60%" style="display: block; margin: auto;" /> <br> .pull-left-45[ * Contains information about the sequencing run and the cluster * Contents vary based on the BCL to FASTQ software version * Record starts with the `@` character * Line contains 11 elements separated by the `:` character ] .pull-right-50[ <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:300px; "><table> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Element </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Description </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: #62b183 !important;"> @ </td> <td style="text-align:left;color: #62b183 !important;"> Record entry </td> </tr> <tr> <td style="text-align:left;color: #7046c8 !important;"> <instrument> </td> <td style="text-align:left;color: #7046c8 !important;"> Instrument ID </td> </tr> <tr> <td style="text-align:left;color: #76b73f !important;"> <run number> </td> <td style="text-align:left;color: #76b73f !important;"> Run number on instrument </td> </tr> <tr> <td style="text-align:left;color: #cd53bf !important;"> <flowcell ID> </td> <td style="text-align:left;color: #cd53bf !important;"> Flowcell ID </td> </tr> <tr> <td style="text-align:left;color: #4f642f !important;"> <lane> </td> <td style="text-align:left;color: #4f642f !important;"> Lane number </td> </tr> <tr> <td style="text-align:left;color: #8787c9 !important;"> <tile> </td> <td style="text-align:left;color: #8787c9 !important;"> Tile number </td> </tr> <tr> <td style="text-align:left;color: #d34d37 !important;"> <x_pos> </td> <td style="text-align:left;color: #d34d37 !important;"> X coordinate of cluster </td> </tr> <tr> <td style="text-align:left;color: #519bb4 !important;"> <y_pos> </td> <td style="text-align:left;color: #519bb4 !important;"> Y coordinate of cluster </td> </tr> <tr> <td style="text-align:left;color: #c6974c !important;"> <read> </td> <td style="text-align:left;color: #c6974c !important;"> Read number (1 or 2) </td> </tr> <tr> <td style="text-align:left;color: #5e3872 !important;"> <is filtered> </td> <td style="text-align:left;color: #5e3872 !important;"> Filtered read (Y or N) </td> </tr> <tr> <td style="text-align:left;color: #7d3e33 !important;"> <control number> </td> <td style="text-align:left;color: #7d3e33 !important;"> Control read </td> </tr> <tr> <td style="text-align:left;color: #ce6687 !important;"> <sample number> </td> <td style="text-align:left;color: #ce6687 !important;"> Sample number </td> </tr> </tbody> </table></div> ] --- ## FASTQ ### Sequence <br> <img src="data:image/png;base64,#data/formats/fastq-sequence-coloured.jpg" width="60%" style="display: block; margin: auto;" /> <br> * Contains the **base calls** for the sequenced read * Read length refers to the number of base calls: `36` * The base calls are limited to `ACGTN` * An `N` means the software was not able to make a call for this base * Multiple `N` base calls may indicate a problem with the sequencing --- ## FASTQ ### Optional identifier <br> <img src="data:image/png;base64,#data/formats/fastq-description-coloured.jpg" width="60%" style="display: block; margin: auto;" /> <br> * Contains **optional** identifier or description * Begins with a `+` character * Somtimes contains a copy of the sequence identifier --- ## FASTQ ### Quality scores <br> <img src="data:image/png;base64,#data/formats/fastq-quality-coloured.jpg" width="60%" style="display: block; margin: auto;" /> <br> .pull-left-50[ * A quality score is an integer mapping of the probability of an **incorrect** base call * Use ASCII characters to represent the numerical probability * Phred quality score calculation: `\(Q=-10\log_{{10}}P\)` ] .pull-right-45[ <table class="table" style="font-size: 15px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Phred quality score </th> <th style="text-align:right;"> Probability of incorrect base call </th> <th style="text-align:right;"> Base call accuracy </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1 in 10 </td> <td style="text-align:right;"> 90% </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 1 in 100 </td> <td style="text-align:right;"> 99% </td> </tr> <tr> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 1 in 1000 </td> <td style="text-align:right;"> 99.9% </td> </tr> <tr> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 1 in 10,000 </td> <td style="text-align:right;"> 99.99% </td> </tr> <tr> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 1 in 100,000 </td> <td style="text-align:right;"> 99.999% </td> </tr> <tr> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 1 in 1,000,000 </td> <td style="text-align:right;"> 99.9999% </td> </tr> </tbody> </table> ] --- ## FASTQ ### Quality encoding * Different sequencing manufacturers use different quality encoding schemes * Encodings are limited to the **printable** ASCII characters (33-126) * Most data you will encounter will use **Phred+33** encoding - *but always check!* <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/formats/quality-encoding.jpg" alt="Phred quality score encoding (Wikipedia)" width="75%" /> <p class="caption">Phred quality score encoding (Wikipedia)</p> </div> --- ## Processing FASTA/Q * There are multiple libraries and tools available to process FASTA/Q files * Because FASTA/Q files can be very large, there has been a lot of effort devoted to developing faster and faster algorithms for reading such files * See this [blog post](http://lh3.github.io/2020/05/17/fast-high-level-programming-languages) by Heng Li if you are *algorithimically inclined* * Examples of some popular libraries: * [Biopython](https://biopython.org) * [Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html) and [ShortRead](https://bioconductor.org/packages/release/bioc/html/ShortRead.html) from [Bioconductor](https://www.bioconductor.org) * Examples of some popular tools: * [Seqkit](https://bioinf.shenwei.me/seqkit/) * [seqtk](https://github.com/lh3/seqtk) --- ## Biopython ### Overview <img src="data:image/png;base64,#data/formats/biopython-logo.png" width="20%" style="display: block; margin: auto 0 auto auto;" /> * Biopython is a set of freely available tools for biological computation written in Python by an international team of developers * It is a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics * It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats * It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI * Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning --- ## Biopython ### Tour <iframe src="https://biopython.org" width="100%" height="500px" data-external="1"></iframe> --- ## Biopython ### Parsing FASTA ```python from Bio import SeqIO for record in SeqIO.parse("data/formats/virus.fasta", "fasta"): print("Header:", record.id) print("Sequence :", repr(record.seq)) print("Length:", len(record)) print("---") ``` ``` ## Header: lcl|EU163411.1_prot_ABW06765.1_1 ## Sequence : Seq('MATKCDLGRHHWMRADNAVCVRPLVPETTSNLLTRWFVSRYEAGELPSKGYMSV...VVS') ## Length: 94 ## --- ## Header: lcl|EU163411.1_prot_ABW06766.1_2 ## Sequence : Seq('MAQNGTGGGSRRPRRGRRNNNNNNSTARDKALLALTQQVNRLANIASSNAPSLQ...PTR') ## Length: 79 ## --- ## Header: lcl|AF170445.1_prot_AAD55052.1_1 ## Sequence : Seq('MKALLAWAVIAIALYQPVAAENITQWNLSDNGTSGIQRAMYLRGVNRSLHGIWP...TGK') ## Length: 233 ## --- ``` --- ## Biopython ### Parsing FASTQ ```python from Bio import SeqIO total = 0 count = 0 for record in SeqIO.parse("data/formats/reads.fastq", "fastq"): total += 1 min_phred_quality = min(record.letter_annotations["phred_quality"]) if min_phred_quality >= 20: count += 1 print("Total:", total) print("Passed:", count) print("Percent:", count / total) ``` ``` ## Total: 100 ## Passed: 50 ## Percent: 0.5 ``` --- ## Bioconductor ### Overview <img src="data:image/png;base64,#data/formats/bioconductor-logo.png" width="30%" style="display: block; margin: auto 0 auto auto;" /> * Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology * Bioconductor is based primarily on the statistical R programming language, but does contain contributions in other programming languages * Most Bioconductor components are distributed as R packages, which are add-on modules for R * The **Biostrings** and **ShortRead** packages can be used for FASTA/Q processing --- ## Bioconductor ### Tour <iframe src="https://bioconductor.org" width="100%" height="500px" data-external="1"></iframe> --- ## Bioconductor ### Parsing FASTA ```r library(Biostrings) records <- readDNAStringSet("data/formats/virus.fasta", format = "fasta") print(records) print(paste("Total:", length(records))) print(paste("Length(s):", paste(width(records), collapse = ","))) print("Alphabet Frequency Table:") alphabetFrequency(records) ``` ``` ## DNAStringSet object of length 3: ## width seq names ## [1] 70 MATKCDGRHHWMRADNAVCVRVTTSNTRWVSRYAGSKGYMSVVCAVTRTVTSDAGSRYADGDKRADAVVS lcl|EU163411.1_pr... ## [2] 62 MANGTGGGSRRRRGRRNNNNNNSTARDKAATVNRANASSNASHTASKKCRAGYTYTSDVRTR lcl|EU163411.1_pr... ## [3] 168 MKAAWAVAAYVAANTWNSDNGTSGRAMYRGVNRSHGWK...SVDYGDHCGSDTAYDGMTNTKARGAARVTSWGRSTTGK lcl|AF170445.1_pr... ## [1] "Total: 3" ## [1] "Length(s): 70,62,168" ## [1] "Alphabet Frequency Table:" ## A C G T M R W S Y K V H D B N - + . ## [1,] 9 3 5 7 3 8 2 7 3 3 10 2 6 0 2 0 0 0 ## [2,] 9 1 6 7 1 11 0 7 2 3 2 1 2 0 10 0 0 0 ## [3,] 18 9 18 23 6 14 8 9 7 10 12 5 10 0 19 0 0 0 ``` --- ## Bioconductor ### Parsing FASTQ ```r library(ShortRead) records <- readFastq("data/formats/reads.fastq", format = "fastq") print(records) print(quality(records)) ``` ``` ## class: ShortReadQ ## length: 100 reads; width: 35..38 cycles ## class: FastqQuality ## quality: ## BStringSet object of length 100: ## width seq ## [1] 35 ################################### ## [2] 38 AAAAA#EEEEA/EEEEEEE6EAEAEEEEEEEEEEEEEE ## [3] 38 AAAAA#EEEAEEAEEEEEEEAAAEEEEEE6E/E<E/AE ## [4] 38 AAAAA#EEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEE ## [5] 37 AAAAA#EEEEEEEEEEEEAEEEE/EEEEEEEEEEEEE ## ... ... ... ## [96] 38 AAAAA/EEEEEE/EEEEE<EEEE/EAEEEAEEAEEEE/ ## [97] 38 AAAAAEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEE ## [98] 38 AAAAAEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEAE ## [99] 37 /AAAAA/AEEEAA/EEEAEE<A///E/E/AEEE//EE ## [100] 37 AAAAAAEEAEEEEEA<EEEAEEEEEEEEAEEEEEEEE ``` --- ## SeqKit ### Overview * SeqKit is a cross-platform ultrafast comprehensive toolkit for FASTA/Q processing * SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OSX * Seqkit can be used to process FASTA/FASTQ files in the following ways: * [Transform](https://bioinf.shenwei.me/seqkit/usage/#seq) sequences (extract ID, filter by length, remove gaps...) * [Statistics](https://bioinf.shenwei.me/seqkit/usage/#stats) (seqs, min/max_len, N50, Q20%, Q30%…) * [Search](https://bioinf.shenwei.me/seqkit/usage/#grep) sequences by identifier, name, sequence, and sequence motifs * [Locate](https://bioinf.shenwei.me/seqkit/usage/#locate) sub-sequences and motifs --- ## SeqKit ### Tour <iframe src="https://bioinf.shenwei.me/seqkit/" width="100%" height="500px" data-external="1"></iframe> --- ## SeqKit ### Examples * Produce an overview of FASTA/Q files: ```bash seqkit stats data/formats/virus.fasta data/formats/reads.fastq ``` ``` ## file format type num_seqs sum_len min_len avg_len max_len ## data/formats/virus.fasta FASTA Protein 3 406 79 135.3 233 ## data/formats/reads.fastq FASTQ DNA 100 3,762 35 37.6 38 ``` * Get GC content of every sequence in FASTA/Q file: ```bash seqkit fx2tab --name --only-id --gc data/formats/virus.fasta ``` ``` ## lcl|EU163411.1_prot_ABW06765.1_1 8.51 ## lcl|EU163411.1_prot_ABW06766.1_2 8.86 ## lcl|AF170445.1_prot_AAD55052.1_1 11.59 ``` * Locate motif in FASTA/Q sequences: ```bash seqkit locate --pattern "AACGCA" data/formats/reads.fastq ``` ``` ## seqID patternName pattern strand start end matched ## SRR4413869.67 AACGCA AACGCA - 13 18 AACGCA ## SRR4413869.96 AACGCA AACGCA + 22 27 AACGCA ``` --- ## Summary * FASTA is used store sequence information from DNA, RNA or protein sequences * FASTQ is used to store DNA output from sequencing machines * FASTQ file format is an extension of FASTA containing quality scores for the base calls * FAST/Q file formats can be processed using a variety of libraries and tools * Biopython/Bioconductor can be used to access and manipulate FASTA/Q files * Seqkit/seqtk have many useful functions (parse, edit, search, convert) to process FASTA/Q files --- ## Resources - [FASTA & FASTQ file formats](https://compgenomr.github.io/book/fasta-and-fastq-formats.html) - [Interpreting quality scores from FASTQ files](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm) <!-- --------------------- Do not edit this and below --------------------- --> --- name: end_slide class: end-slide, middle count: false # Thank you. Questions? .end-text[ <p class="smaller"> <span class="small" style="line-height: 1.2;">Graphics from </span><img src="./assets/freepik.jpg" style="max-height:20px; vertical-align:middle;"><br> Created: 23-Sep-2022 • James Ashmore • <a href="https://www.zifornd.com/category/omics-bioinformatics">Bioinformatics</a> • <a href="https://www.zifornd.com">Zifo</a> </p> ]